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Summary. We review three studies of information flow in social networks that help 
reveal their underlying social structure, how information spreads among them and 
why small world experiments work. 



1 Introduction 

The problem of information flows in social organizations is relevant to issues 
of productivity, innovation and the sorting out of useful ideas from the gen- 
eral chatter of a community. How information spreads determines the speed 
with which individuals can act and plan their future activities. Moreover, in- 
formation flows take place within social networks whose nature is sometimes 
difficult to establish. This is because the network itself is sometimes different 
from what one would infer from the formal structure of the group or organi- 
zation. 

The advent of email as the predominant means of communication in the 
information society now offers a unique opportunity to observe the flow of 
information along both formal and informal channels. Not surprisingly, email 
has been established as an indicator of collaboration and knowledge exchange 
[51, 52, 22, 46, 15]. Email is also a good medium for research because it 
provides plentiful data on personal communication in an electronic form. This 
volume of data enables the discovery of shared interests and relationships 
where none were previously known [41]. 

In this chapter we will review three studies that utilized networks exposed 
by email communication. In all three studies, the networks analyzed were 
derived from email messages sent through the Hewlett Packard Labs email 
server over the period of several months in 2002 and 2003. The first study, by 
Tyler et al. [46], develops an automated method applying a betweenness cen- 
trality algorithm to rapidly identify communities, both formal and informal, 
within the network. This approach also enables the identification of leadership 
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roles within the communities. The automated analysis was complemented by 
a qualitative evaluation of the results in the field. 

The second study, by Wu et al. [54] analyzes email patterns to model 
information flow in social groups, taking into account the observation that an 
item relevant to one person is more likely to be of interest to individuals in 
the same social circle than those outside of it. This is due to the fact that the 
similarity of node attributes in social networks decreases as a function of the 
graph distance. An epidemic model on a scale-free network with this property 
has a finite threshold, implying that the spread of information is limited. 
These predictions were tested by measuring the spread of messages in an 
organization and also by numerical experiments that take into consideration 
the organizational distance among individuals. 

Since social structure affects the flow of information, knowledge of the 
communities that exist within a network can also be used for navigating the 
networks when searching for individuals or resources. The study by Adamic 
and Adar[l], does just this, by simulating Milgram's small world experiment 
on the HP Labs email network. The small world experiment has been carried 
out a number of times over the past several decades, each time demonstrat- 
ing that individuals passing messages to their friends and acquaintances can 
form a short chain between two people separated by geography, profession, 
and race. While the existence of these chains has been established, how peo- 
ple are able to navigate without knowing the complete social networks has 
remained an open question. Recently, models have been proposed to explain 
the phenomenon, and the work of Adamic and Adar is a first study to test 
the validity of these models on a social network. 

2 Email as Spectroscopy 

Communities of practice are the informal networks of collaboration that natu- 
rally grow and coalesce within and outside organizations. Any institution that 
provides opportunities for communication among its members is eventually 
threaded by communities of people who have similar goals and a shared un- 
derstanding of their activities [38] . These communities have been the subject 
of much research as a way to uncover the reality of how people find informa- 
tion and execute their tasks, (for example, see [6, 8, 48], or for a survey see 
[42]). 

These informal networks coexist with the formal structure of the orga- 
nization and serve many purposes, such as resolving the confiicting goals of 
the institution to which they belong, solving problems in more efficient ways 
[24], and furthering the interests of their members. Despite their lack of offi- 
cial recognition, informal networks can provide effective ways of learning, and 
with the proper incentives actually enhance the productivity of the formal 
organization [10, 9, 29]. 
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Recently, there has been an increased amount of work on identifying com- 
munities from online interactions (a brief overview of this work can be found 
in [51]). Some of this work finds that online relationships do indeed reflect 
actual social relationships, thus adding effectively to the "social capital" of 
a community. Duchencaut and BcUotti [13] conducted in-depth field studies 
of email behavior, and found that membership in email communities is quite 
fluid and depends on organizational context. Mailing lists and personal web 
pages also serve as proxies for social relationships [2], and the communities 
identified from these online proxies resemble the actual social communities of 
the represented individuals. Because of the demonstrated value of communi- 
ties of practice, a fast, accurate method of identifying them is desirable. 

Classical practice is to gather data from interviews, surveys, or other field- 
work and to construct links and communities by manual inspection (see [5, 23] 
or an Internet-centric approach in [20]). These methods are accurate but time- 
consuming and labor-intensive, prohibitively so in the context of a very large 
organization. Alani et al. [4] recently introduced a semi-automated utility that 
uses a simple algorithm to identify nearest neighbors to one individual within 
a university department. 

The method of Tyler et al. [46] uses email data to construct a network 
of correspondences, and then discovers the communities by partitioning this 
network. It was applied to a set of over one million email messages collected 
over a period of roughly two months at HP Labs in Palo Alto, an organization 
of approximately 400 people. The only pieces of information used from each 
email are the names of the sender and receiver (i.e., the "to:" and "from:" 
fields), enabling the processing of a large number of emails while minimizing 
privacy concerns. 

The method was able to identify small communities within the organiza- 
tion, and the leaders for those communities, in a matter of hours, running on 
a standard Linux desktop PC. This experiment was followed by a qualitative 
evaluation of the experimental results in the "field" , which consisted of sixteen 
face-to-face interviews with individuals in HP Labs. The interviews validated 
the results obtained by the automated process, and provided interesting per- 
spectives on the communities identified. We describe the results in more detail 
below. 

2.1 Identifying Communities 

It is straightforward to construct a graph based on email data, in which ver- 
tices represent people and edges are added between people who exchanged at 
least a threshold number of email messages. Next, one can identify commu- 
nities: subsets of related vertices, with many edges connecting vertices of the 
same subset, but few edges lying between subsets [21]. 

The method of Wilkinson and Huberman [53], related to the algorithm 
of Girvan and Newman [21], partitions a graph into discrete conimimities of 
nodes and is based on the idea of betweenness centrality, or betweenness. 
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Fig. 1. An example graph with edge AB having high betweenness. 

first proposed by Freeman [18]. The betweenness of an edge is defined as the 
number of shortest paths that traverse it. This property distinguishes inter- 
community edges, which Unk many vertices in different communities and have 
high betweenness, from intra-community edges, whose betweenness is low. 

To illustrate the community discovery process, consider the small graph 
shown in Figure 1. This graph consists of two well-defined communities: the 
four vertices denoted by squares, including vertex A, and the nine denoted 
by circles, including vertex B. Edge AB has the highest betweenness, because 
all paths between any circle and square must pass through it. If one were to 
remove it, the squares and circles would be split into two separate communi- 
ties. The algorithm of Wilkinson et al. repeatedly identifies inter-community 
edges of large betweenness such as AB and removes them, until the graph is 
resolved into many separate communities. 



Because the removal of an edge strongly affects the betweenness of many 
others, the values were repeatedly updated with the fast algorithm of Bran- 
des [7, 36, 21]. The procedure stops removing edges when it cannot further 
meaningfully subdivide communities. Figure 2 shows the smallest possible 
component that can be subdivided into two viable subcommunities. It has 6 
nodes, consisting of two triangles linked by one edge. A component with fewer 
than 6 nodes cannot be subdivided further. 

Components of size > 6, for example the group of size nine in Figure 
1, can also constitute single cohesive communities. Figure 3 shows how the 
algorithm determines when to stop subdividing a community. The edge XY 
has the highest betweenness, but removing it would separate a single node, 
which does not constitute a viable community. In general, the single edge 
connecting a leaf vertex (such as X in Figure 3) to the rest of a graph of N 




Fig. 2. The smallest possible graph of two viable communities. 
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vertices has a betweenness of N—1 , because it contains the shortest path from 
X to all — 1 other vertices. The stopping criterion for components of size > 
6 is therefore that the highest betweenness of any edge in the component be 
equal to or less than N—1. 




Fig. 3. An example graph of one community that does not contain distinct sub- 



2.2 Multiple Community Structures 

As mentioned above, the removal of any one edge affects the betweenness of 
all the other edges, particularly in large, real-world graphs such as the email 
graph. Early in the process, there are many inter- community edges which have 
high betweenness and the choice of which to remove, while arbitrary, dictates 
which edges will be removed later. For example, a node belonging to two com- 
munities can be placed in one or the other by the algorithm, depending on 
the order in which edges are removed. One can take advantage of this arbi- 
trariness to repeatedly partition the graph into many different "structures" 
or sets of communities. These sets are then compared and aggregated into a 
final list of communities. 

Wilkinson and Huberman [53] introduced randomness into the algorithm 
by calculating the shortest paths from a random subset as opposed to all the 
nodes. The algorithm cycles randomly through at least m centers (where m is 
some cutoff) until the betweenness of at least one edge exceeds the threshold 
betweenness of a "leaf" vertex. The edge whose betweenness is highest at 
that point is removed, and the procedure is repeated until the graph has 
been separated into communities. The modified algorithm may occasionally 
remove an intra-community edge, but such errors are unimportant when a 
large number of structures is aggregated. 

Applying this modified process n times yields n community structures 
imposed on the graph. One can then compare the different structures and 
identify communities. For example, after imposing 50 structures on a graph, 
one might find: a community of people A, B, C, and D in 25 of the 50 struc- 
tures; a community of people A, B, C, D, and E in another 20; and one of 
people A, B, C, D, E and F in the remaining 5. This result is reported in the 
following way: A(50) B(50) C(50) D(50) E(25) F(5) which signifies that A, B, 
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C, and D form a well-defined community, E is related to this community, but 
also to some otlicr(s), and F is only slightly, possibly erroneously, related to 
it. For details of the aggregation procedure, please see [53]. 

The entire process of determining community structure within the graph 
is displayed below. 

• For i iterations, repeat { 

1. Break the graph into connected components. 

2. For each component, check to see if component is a community. 

- If so, remove it from the graph and output it. 

- If not, remove edges of highest betweenness, using the modified 
Brandes algorithm for large components, and the normal algorithm 
for small ones. Continue removing edges until the community splits 
in two. 

3. Repeat step 2 until all vertices have been removed from the graph in 
communities. } 

• Aggregate the i structures into a final list of communities. 
2.3 Results 

The algorithm was applied to email data from the HP Labs mail server from 
the period November 25, 2002 to February 18, 2003, with 185,773 emails 
exchanged between the 485 HP Labs employees. For simplicity, emails that 
had an external origin or destination were omitted. Messages sent to a list 
of more than 10 recipients were likewise removed, as these emails were often 
lab- wide announcements (rather than personal communication), which were 
not useful in identifying communities of practice. 

A graph was constructed from this data by placing edges between any two 
individuals that had exchanged at least 30 emails in total, and at least 5 in 
both directions. The threshold eliminated infrequent or one-way communica- 
tion, and eliminated some individuals from the graph who either sent very 
few emails or used other email systems. The resulting graph consisted of 367 
nodes, connected by 1110 edges. 

There was one giant connected component of 343 nodes and six smaller 
components ranging in size from 2 to 8. The modified Brandes algorithm de- 
tected 60 additional distinct communities within the giant component. The 
largest community consisted of 57 individuals, and there were several commu- 
nities of size 2. The mean community size was 8.4, with standard deviation 
5.3. A comparison of these communities with information from the HP corpo- 
rate directory revealed that 49 of the 66 communities consisted of individuals 
entirely within one lab or organizational unit. The remaining 17 contained 
individuals from two or more organizations within the company. 
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2.4 Identifying Leadership Roles 

In addition to identifying formal and informal work communities, it is also pos- 
sible to draw inferences about the leadership of an organization from its com- 
munication data. One method is to visualize the above graph of the HP Labs 
email network with a standard force-directed spring algorithm [19], shown in 
Figure 4. This spring layout of the email network does not use any informa- 
tion about the actual organization structure, and yet high level managers (the 
reddest nodes are at the top of the hierarchy) are placed close to the center of 
the graph. The trend is quantified in Table 1, which lists the average hierarchy 
depth (levels from the lab director) as a function of the position in the layout 
from the center. 

Note that there is a group of 6 nodes in the upper right portion of the 
graph that are quite removed from the center, but are relatively high in the 
organizational hierarchy. This is the university relations group that reports 
directly to the head of HP Labs, but has no other groups reporting to it. Hence 
the layout algorithm correctly places them on the periphery of the graph, since 
their function, that of managing HP's relationship with universities, while 
important, is not at the core of day-to-day activities of the labs. 



Fig. 4. The giant connected component of the HP Labs email network. The redness 
of a vertex indicates an individual's closeness to the top of the lab hierarchy (red- 
close to top, blue-far from top, black-no data available). 
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distance from center 


number of vertices 


average depth in hierarchy 


< 0.1 


14 


2.6 


0.1 to 0.2 


32 


3.0 


0.2 to 0.3 


56 


3.2 


0.3 to 0.4 


66 


4.0 


0.4 to 0.5 


56 


4.0 


0.5 to 0.6 


45 


4.2 


0.6 to 0.7 


42 


4.0 


0.7 to 0.8 


12 


3.9 


0.8 to 0.9 


13 


3.8 



Table 1. Average hierarchy depth by distance from center in layout 



Evaluating communication networks with this technique could provide in- 
formation about leadership in communities about which little is known. Spar- 
row proposed this approach for analyzing criminal networks [43] , noting that 
"Euclidean Centrality is probably the closest to the reality" of the current 
criminal network analysis techniques. More recently, Krebs applied centrality 
measures and graphing techniques [28] to the terrorist networks uncovered in 
the 9/11 aftermath. He found that the average shortest path was unusually 
long for such a small network, and concluded that the operation had traded 
efficiency for secrecy - individuals in one part of the network did not know 
those in other parts of the network. If one cell had been compromised, the 
rest of the network would remain relatively unaffected. Several social network 
centrality measures pointed to Mohamed Atta's leadership role in the attacks 
of Sept. 11. The role was also confirmed by Osama bin Laden in a video tape 
following the attacks. 

2.5 Field Evaluation 

The HP Labs social network, being much less covert, could readily be com- 
pared to the structure of the formal organization. Nevertheless, the informal 
communities identified by the algorithm could not be verified in this way. 
Tyler et al. decided to validate the results of their algorithm by conducting a 
brief, informal field study. Sixteen individuals chosen from seven of the sixty 
communities identified were interviewed informally. The communities chosen 
represented various community sizes and levels of departmental homogeneity. 
They ranged in size from four to twelve people, and three out of the seven 
were heterogeneous (included members of at least two different departmental 
units within the company). 

All sixteen subjects gave positive affirmation that the community reflected 
reality. More specifically, eleven described the group as refiecting their de- 
partment, four described it as a specific project group, and one said it was 
a discussion group on a particular topic. Nine of the sixteen (56.25%) said 
nobody was missing from the group, six people (37.5%) said one person was 
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missing, and one person (6.25%) said two people were missing. Conversely, 

ten of the sixteen (62.5%) said that everybody in the group deserved to be 
there, whereas the remaining six (37.5%) said that one person in the group 
was misclassified. 

The interviews confirmed that most of the communities identified were 
based on organization structure. However, the communities also tended to 
include people who were de facto department members, but who did not 
technically appear in the department's organization chart, such as interns or 
people whose directory information had changed during the two months of 
the study. Finally, the algorithm seemed to succeed in dividing departmental 
groups whose work is distinct, but lumped together groups whose projects 
overlap. 

Heterogeneous, cross-department communities are of particular interest 
because they cannot be deduced from the formal organization. The inter- 
views revealed that most of them represented groups formed around specific 
projects, and in one case, a discussion forum. For example, one community 
contained three people from different labs coordinating on one project: a tech- 
nology transfer project manager, a researcher who was the original designer 
of a piece of PC hardware, and an engineer redesigning the hardware for a 
specific printer. 

2.6 Discussion 

The power of this method for identifying communities and leadership is in 
its automation. It does an effective job of uncovering communities of practice 
with nothing more than email log ("to:" and "from:") data. Its simplicity 
means that it can be applied to organizations of thousands and produce re- 
sults efficiently. However, it is important for computing centrality measures 
to be able to define membership in an organization as well as disambiguate 
identities. In a setting like a corporate lab, the organization is clearly defined 
and identities can be clarified from official directories. In an informal network, 
however, these tasks are much more difficult. 

Communities identified in this automated way lack the richness in contex- 
tual description provided by ethnographic approaches. They do not reveal the 
nature or character of the identified communities, the relative importance of 
one community to another, or the subtle inter-personal dynamics within the 
communities. These kinds of details can only be uncovered with much more 
data- or labor-intensive techniques. However, in cases where an organization is 
very large, widely dispersed, or incompletely defined (informal), this method 
provides an suitable alternative or compliment to the more traditional, labor- 
intensive approaches. 
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3 Information Flow in Social Groups 

In the previous section we saw that individuals tend to organize both formally 
and informally into groups based on their common activities and interests. In 
this section wc examine how this structure in the interaction network affects 
the way information spreads. This is not unlike the transmission of an infec- 
tious agent among individuals, where the pattern of contacts determines how 
far a disease spreads. Thus one would expect that epidemic models on graphs 
are relevant to the study of information flow in organizations. In particular, 
recent work on epidemic propagation on scale free networks found that the 
threshold for an epidemic is zero, implying that a finite fraction of the graph 
becomes infected for arbitrarily low transmission probabilities [11, 39, 34]. 
The presence of additional network structure was found to further influence 
the spread of disease on scale-free graphs [16, 47, 33]. 

There are, however, differences between information flows and the spread 
of viruses. While viruses tend to be indiscriminate, infecting any susceptible 
individual, information is selective and passed by its host only to individuals 
the host thinks would be interested in it. The information any individual is in- 
terested in depends strongly on their characteristics. Furthermore, individuals 
with similar characteristics tend to associate with one another, a phenomenon 
known as homophily [30, 44, 17]. Conversely, individuals many steps removed 
in a social network on average tend not to have as much in common, as shown 
in a study [2] of a network of Stanford student homepages and illustrated in 
Figure 5. 

Wu et al. [54] introduced an epidemic model with decay in the transmission 

probability of a particular piece of information as a function of the distance 
between the originating source and the current potential target. This epidemic 
model on a scale-free network has a finite threshold, implying that the spread 
of information is limited. The predictions were further tested by observing the 
prevalence of messages in an organization and also by numerical experiments 
that take into consideration the organizational distance among individuals. 

Consider the problem of information transmission in a power-law network 
whose degree distribution is given by 



where a > 1, there is an exponential cutoff at k and C is determined by the 
normalization condition. A real world graph will at the very least have cutoff 
at the maximum degree k = N, where N is the number of nodes, and many 
networks show a cutoff at values much smaller than N. For the analysis of the 
spread of information flow on networks, Wu et al. used generating functions, 
whose application to graphs with arbitrary degree distributions is discussed 
in [35] . For a power-law network the generating function is given by 



(1) 
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Fig. 5. Average similarity of Stanford student homepages as a function of the 
number of hyperlinks separating them. 

where Lin{x) is the nth poly logarithm of x. 

Following the analysis in [37] for the SIR (susceptible, infected, removed) 
model, one can estimate the probability pin that the first person in the com- 
munity who has received a piece of information will transmit it to m of their 
neighbors. Using the binomial distribution, we find 



where the superscript "(1)" refers to first neighbors, those who received the 
information directly from the initial source. The transmissiblity T is the av- 
erage total probability that an infective individual will transmit an item to a 
susceptible neighbor and is derived in [37] as a function of rij, the rate of con- 
tacts between two nodes, and r^, the time a node remains infective. If rij and 
Ti are iid randomly distributed according to the distributions P{r) and P{t), 
then the item will propagate as if all transmission probabilities are equal to a 
constant T. 




k—ni 



(3) 




The generating function for pm is given by 




m=0 k=m 



Go(l + (x-l)T) = Go(x;T). 



(6) 
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Suppose the transmissibility decays as a power of the distance from the 
initial source. The probabiHty that an mth neighbor wiU transmit the infor- 
mation to a person with whom he has contact is given by 



i(m) 



{m+l)-^T, (7) 



where /3 > is the decay constant. T^™) = T at the originating node (m = 0) 
and decays to zero as m — > oo. Power- law decay is the weakest form of decay 
and the results obtained from it will also be valid for stronger functional forms 
such as an exponential decay. 

The generating function for the transmission probability to 2nd neighbors 
can be written as 

G^'\^) = E^^i'^[Gi'^(^)]' = G^'\G^i\^))^ (8) 

k 

G^P{x) = Gi{x; 2-Pt) = Gi{l + {x- 1)2-^T) (9) 



where 
and 



is the generating function of the degree distribution of a vertex reached by 
following a randomly chosen edge, not counting the edge itself [35]. Similarly, 
if we define G^"*) (x) to be the the generating function for the number of mth 
neighbors affected, then we have 

=G(™)(G^'"^(a;)) for m > 1, (11) 

where 

G\^\x)=Gi{x;{m + ir'^T)=Gi{l + {x-l){m+l)-f^T). (12) 
Or, more explicitly, 

G(™+i)(a;) =G(i)(G^^)(Gf^(.-.C?('"^(a;)))). (13) 
The average number Zm+i of (m -|- l)th neighbors is 

z„,+i = G(™+i)'(l) = g(")'(1)G(™)'(1) = G^f'^\l)zm. (14) 

The condition that the size of the outbreak remains finite is that at some 
distance m + 1, fewer individuals will be infected than at distance m, i.e. the 
spread of the infection is halting. This can be expressed as 



Zm+l 



g1")'(1) < 1, (15) 



or 
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{m + l)-^^TG[{l) <1. (16) 

Note that G'i{l) does not diverge when a < 3 due to the presence of a cutoff at 
K. For any decaying T, the left hand side of the inequality above goes to zero 
when m — > oo, so the condition is eventually satisfied for large to. Therefore 
the average total size 

oc 

(s) = 5] (17) 

m—1 

is always finite if the transmissibility decays with distance. 
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Fig. 6. Tc as a function of a. The three different curves, from bottom to top are: 1) 
no decay in transmission probability, no exponential cutoff in the degree distribution 
{k = oo,(3 = 0). 2) K = 100,/3 = 0, 3) K= 100,13= 1. 



Wu et al. compared their model with previous results [39] on disease spread 
on scale-free networks, by considering a network made up of 10^ vertices. An 
epidemic was defined to be an outbreak affecting more than 1% or 10^ vertices. 
Thus for fixed a, k and /3, Tc is the critical transmissibility above which (s) 
would be made to exceed 10'*. 

The numerical result of Tc as a function of a is shown in Figure 6. When 
/3 = (there is no decay in transmission probability), k = oo (there is no cutoff 
in the degree distribution), and a < 3, Tc is zero and epidemics encompassing 
more than 10** vertices occur for arbitrarily small T, as was found in [39]. 
Keeping /3 at zero and adding a cutoff at k = 100 produces a non-zero critical 
transmissibility Tc, as was found in [37]. For a = 2, a typical value for real- 
world networks, Tc is still very near zero, meaning that for most values of T, 
epidemics do occur. However, when we impose a decay in transmissibility by 
setting P to 1, Tc rises substantially. For example, Tc jumps to 0.54 at a = 2 
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number of recipients 



Fig. 7. Number of people receiving URLs and attachments 

and rises rapidly to 1 as a increases further, implying that the information 
may not spread over the network. 

In order to validate empirically that the spread of information within a 
network of people is limited, and hence distinct from the spread of a virus, a 
sample from the mail clients of 40 individuals (30 within HP Labs, and 10 from 
other areas of HP, other research labs, and universities) was gathered. Each 
volunteer executed a program that identified URLs and attachments in the 
messages in their mailboxes, as well as the time the messages were received. 
This data was cryptographically hashed to protect the privacy of the users. 
By analyzing the message content and headers, the data was restricted to 
include only messages which had been forwarded at least one time, thereby 
eliminating most postings to mailing lists and more closely approximating 
true inter-personal information spreading behavior. The median number of 
messages in a mailbox in the sample was 2200, indicating that many users keep 
a substantial portion of their email correspondence. Although some messages 
may have been lost when users deleted them, it was assumed that a majority 
of messages containing useful information had been retained. 

Figure 7 shows a histogram of how many users had received each of the 
3401 attachments and 6370 URLs. The distribution shows that only a small 
fraction (5% of attachments and 10% of URLs) reached more than 1 recipient. 
Very few (41 URLs and 6 attachments) reached more than 5 individuals, a 
number which, in a sample of 40, starts to resemble an outbreak. In follow-up 
discussions with the study subjects, the content and significance of most of 
these messages was identified. 14 of the URLs were advertisements attached 
to the bottom of an email by free email services such as Yahoo and MSN. 
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Fig. 8. Outdegree distribution for all senders (224,514 in total) sending email to 

or from the HP Labs email server over the eourse of 3 months. The outdegree of a 
node is the number of correspondents the node sent email to. 



These are in a sense viral, because the sender is sending them involuntarily. 
It is this viral strategy that was responsible for the rapid buildup of the 
Hotmail free email service user base. 10 URLs pointed to internal HP project 
or personal pages, 3 URLs were for external commercial or personal sites, and 
the remaining 14 could not be identified. 

The next portion of the analysis analyzed the effect of decay in the trans- 
mission probability on the email graph at HP Labs. The graph was constructed 
from recorded logs of all incoming and outgoing messages over a period of 3 
months. The graph has a nearly power-law out degree distribution, shown in 
Figure 8, including both internal and external nodes. Because all of the outgo- 
ing and incoming contacts were recorded for internal nodes, their in and out 
degrees were higher than for the external nodes for which we could only record 
the email they sent to and received from HP Labs. A graph with the internal 
and external nodes mixed (as in [14]) was used to specifically demonstrate the 
effect of a decay on the spread of email in a power-law graph. 

The spread of a piece of information was simulated by selecting a random 
initial sender to infect and following the email log containing 120,000 entries 
involving over 7,000 recipients in the course of a week. Every time an infec- 
tive individual (one willing to transmit a particular piece of information) was 
recorded as sending an email to someone else, they had a constant probability 
p of infecting the recipient. Hence individuals who email more often have a 
higher probability of infecting. It is also assumed that an individual remains 
infective for a period of 24 hours. 
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Fig. 9. Average outbreak and epidemic size as a function of the transmission prob- 
ability p. The total number of potential recipients is 7119. 

Next a decay was introduced in the one-time transmission probability pij 
as p*c?~.^'^^, where dij is the distance in the organizational hierarchy between 
individuals i and j. The exponent roughly corresponds to the decay in simi- 
larity between homepages shown in Figure 5. Here Vij = pij * fij, where is 
the frequency of communication between the two individuals, obtained from 
the email logs. The decay represents the fact that individuals closer together 
in the organizational hierarchy share more common interests. Individuals have 
a distance of one to their immediate superiors and subordinates and to those 
they share a superior with. The distance between someone within HP labs and 
someone outside of HP labs was set to the maximum hierarchical distance of 
8. 

Figure 9 shows the variation in the average outbreak size, and the average 
epidemic size (chosen to be any outbreak affecting more than 30 individuals). 
Without decay, the epidemic threshold falls below p = 0.01. With decay, the 
threshold is set back to p = 0.20 and the outbreak epidemic size is limited to 
about 50 individuals, even for p = 1. 

As these results show, the decay of similarity among members of a social 
group has strong implications for the propagation of information among them. 
In particular, the number of individuals that a given email message reaches 
is very small, in contrast to what one would expect on the basis of a virus 
epidemic model on a scale free graph. The implication of this finding is that 
merely discovering hubs in a community network is not enough to ensure that 
information originating at a particular node will reach a large fraction of the 
community. 
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4 Small World Search 

In the preceding section we discussed how the tendency of Uke individuals 
to associate with one another can affect the flow of information within an 
organization. In this section wc will show how one can take advantage of the 
very same network structure to navigate social ties and locate individuals. 

The observation that any two people in the world are most likely linked 
by a short chain of acquaintances, known as the "small world" phenomenon 
has been the focus of much research over the last forty years [32, 45, 31, 25]. 
In the 1960's and 70's, articipants in small world experiments successfully 
found paths from Nebraska to Boston and from Los Angeles to New York. In 
an experiment in 2001 and 2002, 60,000 individuals were able to repeat the 
experiment using email to form chains with just four links on average across 
different contents [12]. The small world phenomenon is currently exploited by 
commercial networking services such as Linkedin, Friendster, and Spoke^to 
help people network, for both business and social purposes. 

The existence of short paths is not particularly surprising in and of itself. 
Although many social ties are "local" meaning that they are formed through 
ones work or place of residence, Watts and Strogatz[50] showed that it takes 
only a few "random" links between people of different professions or location to 
create short paths in a social network and make the world "small" . In addition. 
Pool and Kochen[40] have estimated that an average person has between 500 
and 1,500 acquaintances. Ignoring for the moment overlap in one's circle of 
friends, one would have 1, 000^ or 1, 000, 000 friends of friends, and 1, 000^ or 
one billion £riends-of-£riends-of- friends. This means that it would take only 2 
intermediaries to reach a number of people on the order of the population of 
the entire United States. 

Although the existence of short paths is not surprising, it is another ques- 
tion altogether how people are able to select among hundreds of acquain- 
tances the correct person to form the next link in the chain. Killworth and 
Barnard [25] performed the "reverse" experiment to measure how many ac- 
quaintances a typical person would use as a first step in a small world ex- 
periment. Presented with 1,267 random targets, the subjects chose about 210 
different acquaintances on average, based overwhelmingly on geographic prox- 
imity and similarity of profession to the targets. 

Recently, mathematical models have been proposed to explain why people 
are able to find short paths. The model of Watts, Dodds, and Newman [49] 
assumes that individuals belong to groups that are embedded hierarchically 
into larger groups. For example an individual might belong to a research lab, 
that is part of an academic department at a university, that is in a school 
1 

http : //www. linkedin. com/, http: //www. friendster . com, 
http : //www . spokesof tware . com 



18 



Bernardo A. Huberman and Lada A. Adamic 



35 



30 
25 
20 
15 
10 
5 



"O 10 20 30 40 50 60 70 

number of email correspondents, k 

Fig. 10. Degree distribution in tlie HP Labs email network. Two individuals are 
linked if they exchanged at least 6 emails in either direction. The inset shows the 
same distribution, but on a semilog scale, to illustrate the exponential tail of the 
distribution 

consisting of several departments, that is part of a university, that is one of 
the academic institutions in the same country, etc. The probability that two 
individuals have a social tie to one another is proportional to exp""'', where 
h is the height of their lowest common branching point in the hierarchy. 

The decay in hnking probabihty means that two people in the same re- 
search laboratory are more hkely to know one another than two people who 
are in different departments at a university. The model assumes a number of 
separate hierarchies corresponding to characteristics such as geographic loca- 
tion or profession. In reahty, the hierarchies may be intertwined, for example 
professors at a university living within a short distance of the university cam- 
pus, but for simphcity, the model treats them separately. 

In numerical experiments, artificial social networks were constructed and 
a simple greedy algorithm was performed where the next step in the chain 
was selected to be the neighbor of the current node with the smallest distance 
along any dimension. At each step in the chain there is a fixed probability, 
called the attrition rate, that the node will not pass the message further. The 
numerical results showed that for a range of the parameter a and number of 
attribute dimensions, the networks are "searchable" , meaning that a minimum 
fraction of search paths find their target. 

Kleinberg [26, 27] posed a related question: in the absence of attrition, 
when does the length of the chains scale in the same way as the average 
shortest path. Unlike the study of Watts, et al., there is no attrition - all 
chains run until completion, but need to scale as the actual shortest path in 
the network does. In the case of a small world network, the average shortest 
path scales as In(A^), where N is the number of nodes. Kleinberg proved 
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that a simple greedy strategy based on geography could achieve chain lengths 
bounded by (InTV)^ under the following conditions: nodes are situated on an 
TO-dimensional lattice with connections to their 2 * m closest neighbors and 
additional connections are placed between any two nodes with probability 
p r~™, where r is the distance between them. Since in the real world our 
locations are specified primarily by two dimensions, longitude and latitude, the 
probability is inversely proportional to the square of the distance. A person 
should be four times as likely to know someone living a block away, than 
someone two city blocks away. However, Kleinberg also proved that if the 
probabilities of acquaintance do not follow this relationship, nodes would not 
be able to use a simple greedy strategy to find the target in polylogarithmic 
time. 

The models of both Watts et al. and Kleinberg show that the probability 
of acquaintance needs to be related to the proximity between individuals' at- 
tributes in order for simple search strategies using only local information to be 
effective. Below we describe experiments empirically testing the assumptions 
and predictions of the proposed two models. 

4.1 Method 

In order to test the above hypothesis, Adamic and Adar [1] applied search 
algorithms to email networks derived from the email logs at HP Labs already 
described in section 2. A social contact was defined to be someone with whom 
an individual had exchanged at least 6 emails each way over the period of 
approximately 3 months. The bidirectionality of the email correspondence 
guaranteed that a conversation had gone on between the two individuals and 
hence that they are familiar with one another. 

Imposing this constraint yielded a network of 436 individuals with a me- 
dian number of 10 acquaintances and a mean of 13. The degree distribution, 
shown in Figure 10, is highly skewed with an exponential tail. This is in con- 
trast to the raw power-law email degree distribution, used in section 3 and 
shown in Figure 8, pertaining to both internal and external nodes and pos- 
sessing no threshold in email volume. A scale free distribution in the raw 
network arises because there are many external nodes emailing just one indi- 
vidual inside the organization, and there are also some individuals inside the 
organization sending out announcements to many people and hence having a 
very high degree. However, once we impose a higher cost for maintaining a 
social contact (that is, emailing that contact at least six times and receiving 
at least as many replies), then there are few individuals with many contacts. 

4.2 Simulating Milgram's experiment on an email network 

The resulting network, consisting of regular email patterns between HP Labs 

employees, had 3.1 edges separating any two individuals on avc^ragx;, and a 
median of 3. Simulations were performed on the network to determine whether 
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Fig. 11. Email communications within HP Labs (gray lines) mapped onto the orga- 
nizational hierarchy (black lines). Note that email communication tends to "cling" 
to the formal organizational chart. 



members of the network would be able to use a simple greedy algorithm to 
locate a target. In this simple algorithm, each individual can use knowledge 
only of their own email contacts, but not their contacts' contacts, to forward 
the message. 

Three different strategies were tested, at each step passing the message to 
the contact who is either 

• best connected 

• closest to the target in the organizational hierarchy 

• sitting in closest physical proximity to the target 

The first strategy selects the individual who is more likely to know the 
target by virtue of the fact that he/she knows so many people. It has been 
shown [3] , that this is an effective strategy in power-law networks with expo- 
nents close to 2 (the case of the unfiltered HP Labs email network) , but that 
it performs poorly in graphs with a Poisson degree distribution that has an 
exponential tail. Since the distribution of contacts in the filtered HP network 
was not power-law, the high degree strategy was not expected to perform 
well, and this was verified through simulation. The median number of steps 
required to find a randomly chosen target from a random starting point was 
17, compared to the three steps in the average shortest path. Even worse, the 
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source A ■ target B 

h. distance 1 

Fig. 12. Example illustrating a search path using information about the target's 
position in the organizational hierarchy to direct a message. Numbers in the square 
give the h-distance from the target. 

average number of steps is 40. This discrepancy between the mean and the 
median is a reflection of the skewness of the distribution: a few well connected 
individuals and their contacts are easy to find, but some individuals who do 
not have many links and are not connected to highly connected individuals 
are difficult to locate using this strategy. 

The second strategy consisted of passing the message to the contact clos- 
est to the target in the organizational hierarchy. The strategy relies on the 
observation, illustrated in Figures 11 and 13 that individuals closer together 
in the organizational hierarchy are more likely to email with one another. Fig- 
ure 12 illustrates such a search, labelling nodes by their hierarchical distance 
(h-distance) from the target. The h-distance is computed as follows: a node 
has distance one to their manager and to everyone they share a manager with. 
Distances are then recursively assigned, so that each node has h-distance 2 to 
their first neighbor's neighbors, and h-distance 3 to their second neighbor's 
neighbors, etc. A simple greedy strategy using information about the organi- 
zational hierarchy worked extremely well. The median number of steps was 
only 4, close to the median shortest path of 3. With the exception of one in- 
dividual, whose manager was not located on site, and who was consequently 
difficult to locate, the mean number of steps was 4.7, meaning that not only 
are people typically easy to find, but nearly everybody can be found in a 
reasonable number of steps. 

In the original experiment by Milgram the completed chains were divided 
between those that reached the target through his professional contacts and 
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Fig. 13. Probability of linking as a function of the separation in the organizational 
hierarchy. The exponential parameter a — 0.92, in the searchable range according 
to the model of Watts et al. [49] 

those that reached him through his hometown. On average those that rehed 
on geography took 1.5 steps longer to reach the target, a difference found 
to be statistically significant. In the words of Travers and Milgram [45], the 
following seemed to occur: "Chains which converge on the target principally 
by using geographic information reach his hometown or the surrounding areas 
readily, but once there often circulate before entering the target's circle of ac- 
quaintances. There is no available information to narrow the field of potential 
contacts which an individual might have within the town." 

Performing the small world experiment on the HP email network using 
geography produced a similar result, in that geography could be used to find 
most individuals, but was slower, taking a median number of 7 steps, and a 
mean of 12. Figure 14 shows the email correspondence mapped onto the phys- 
ical layout of the buildings. Individuals' locations are given by their building, 
the floor of the building, and the nearest building post (for example "HIS" ) 
to their cubicle. The distance between two cubicles was approximated by the 
"street" distance between their posts (for example "A3" and "CIO" would be 
(C - A) * 25' + (10 - 3) * 25' = 2 * 25' 7 * 25' = 225 feet apart). Adding 
the X and y directions separately reflects the interior topology of the buildings 
where one navigates perpendicular hallways and cannot traverse diagonally. If 
individuals are located on different floors or in different buildings, the distance 
between buildings and the length of the stairway are factored in. 
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Fig. 14. Email communications within HP Labs mapped onto approximate physical 
location based on the nearest post number and building given for each employee. 
Each box represents a different floor in a building. The lines are color coded based 
on the physical distance between the correspondents: red for nearby individuals, 
blue for far away contacts. 

Figure 16 shows a histogram of chain lengths resulting from searches using 
each of the three strategies. It shows the clear advantage of using the target's 
position in organizational hierarchy as opposed to his/her cubicle location to 
pass a message through one's email contact. It also shows that both searches 
using information about the target outperform a search relying solely on the 
connectivity of one's contacts. 

4.3 Discussion 

The above simulated experiments verify the models proposed in [49] and [26] 
to explain why individuals are able to successfully complete chains in the small 
world experiments using only local information. When individuals belong to 
groups based on a hierarchy and are more likely to interact with individuals 
within the same small group, then one can safely adopt a greedy strategy - 
pass the message onto the individual most like the target, and they will be 
more likely to know the target or someone closer to them. 
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Fig. 15. Probability of two individuals corresponding by email as a function of tlie 
distance between tlieir cubicles. The inset shows how many people in total sit at a 
given distance from one another. 



At the same time it is important to note that the optimum relationship 
between the probabihty of acquaintance and distance in physical or hierar- 
chical space between two individuals, as outlined in [26, 27], are not satisfied. 
The general tendency of individuals in close physical proximity to correspond 
holds: over 87% percent of the 4000 email links are between individuals on 
the same floor, and overall there is a tendency of individuals in close physical 
proximity to correspond. Still, individuals maintain disproportionately many 
far-flung contacts while not getting to know some of their close-by neighbors. 
The relationship between probability of acquaintance and cubicle distance r 
between two individuals, shown in Figure 15, is well- fitted by a 1/r curve. 
However, Kleinberg has shown that the optimum relationship in two dimen- 
sional space is - a stronger decay in probability of acquaintance than the 
1/r observed. 

In the case of HP Labs, the geometry may not be quite two dimensional, 
because it is complicated by the particular layout of the buildings. Hence the 
optimum relationship may lie between 1/r and 1/r^. In any case, the observed 
1/r probability of linking shows a tendency consistent with Milgram's obser- 
vations about the original small world experiment. At HP Labs, because of 
space constraints, re-organizations, and personal preferences, employees' cu- 
bicles may be removed from some of the co-workers they interact with. This 
hinders a search strategy relying solely on geography, because one might get 
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Fig. 16. Results of search experiments utilizing either knowledge of the target's 
position in the organizational hierarchy or the physical location of their cubicle. 

physically quite close to the target, but still need a number of steps to find 
an individual who interacts with them. 

The same is true, but to a lesser extent, of the contacts individuals establish 
with respect to the organizational hierarchy. In Section 2 email spectroscopy 
revealed that while collaborations mostly occurred within the same organi- 
zational unit, they also frequently bridged different parts of the organization 
or broke up a single organizational unit into noninteracting subgroups. The 
optimum relationship derived in [27] for the probability of linking would be in- 
versely proportional to the size of the smallest organizational group that both 
individuals belong to. However, the observed relationship, shown in Figure 17 
is slightly off, with p ^ g^"^^^, g being the group size. 

Overall, the results of the email study are consistent with the model of 
Watts et al. [49] . This model does not require the search to find near optimum 
paths, but simply determines when a network is "searchable", meaning that 
fraction of messages reach the target given a rate of attrition. The relationship 
found between separation in the hierarchy and probability of correspondence, 
shown in Figure 13, is well within the searchable regime identified in the 
model. 

The study of Adamic and Adar is a first step, validating these models on a 
small scale. The email study gives a concrete way of observing how the small 
world chains can be constructed. Using a very simple greedy strategy, indi- 
viduals across an organization could reach each other through a short chain 
of coworkers. It is quite likely that similar relationships between acquaintance 
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Fig. 17. Probability of two individuals corresponding by email as a function of the 
size of the smallest organizational unit they both belong to. The optimum relation- 
ship derived in [27] is p ~ g"^, g being the group size. The observed relationship is 



and proximity (geographical or professional) hold true in general, and there- 
fore that small world experiments succeed on a grander scale for the very same 
reasons. 



5 Conclusion 

In this chapter we reviewed three studies of information flow in social net- 
works. The first developed a method of analyzing email communication au- 
tomatically to expose communities of practice and their leaders. The second 
showed that the tendency of individuals to associate according to common in- 
terests influences the way that information spreads throughout a social group. 
It spreads quickly among individuals to whom it is relevant, but unlike a virus, 
is unable to infect a population indiscriminately. The third study showed why 
small world experiments work - how individuals are able to take advantage 
of the structure of social networks to flnd short chains of acquaintances. All 
three studies relied on email communication to expose the underlying social 
structure, which previously may have been difficult and labor-intensive to ob- 
tain. We expect that these flndings are also valid with other means of social 
communication, such as verbal exchanges, telephony and instant messenger 
systems. 
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